Introduction
This file contains essential commands from the chapters of r4ds and corresponding examples. A command is considered “essential” when you really need to know it and need to know how to use it to succeed in this course.
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
Course coordinates
- Course Data Science for Psychologists (ds4psy).
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Spring/summer 2018: Mondays, 13:30–15:00, C511.
- Links to ZeUS and Ilias
Preparations
Create an R script (.R) or an R-Markdown file (.Rmd) and load the R packages of the tidyverse. (Hint: Structure your script by inserting spaces, meaningful comments, and sections.)
## Essential commmands | Data science for psychologists
## 2018 06 24
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## Preparations: -----
library(tidyverse)
## Topic: -----
# ...
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- Tibbles
Whenever working with rectangular data structures – data consisting of multiple cases (rows) and variables (columns) – our first step is to create or transform the data into a tibble (i.e., a simple version of a data frame).
Creating tibbles
Basic commands
There are 3 basic commands for creating tibbles:
as_tibbleconverts (or coerces) an existing data frame into a tibble.tibbleconverts several vectors into (the columns of) a tibble.tribbleconverts a table (entered row-by-row) into a tibble.
Check: The 3 commands yield the same type of output (i.e., a tibble), but require different inputs. Ask yourself which kind of input each command takes and how this input needs to be structured and formatted (e.g., with commas).
1. as_tibble
Use as_tibble when the data to be used already is in a data frame (or matrix):
## Using the data frame `sleep`: ------
# ?datasets::sleep # provides background information on the data set.
# Save the sleep data frame as df:
df <- datasets::sleep
# Convert df into a tibble tb:
tb <- as_tibble(df)
# Inspect the data frame df:
dim(df)
#> [1] 20 3
is.data.frame(df)
#> [1] TRUE
head(df)
#> extra group ID
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
str(df)
#> 'data.frame': 20 obs. of 3 variables:
#> $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
# Inspect the tibble tb:
dim(tb)
#> [1] 20 3
is.tibble(tb)
#> [1] TRUE
is.data.frame(tb) # => tibbles ARE data frames.
#> [1] TRUE
head(tb)
#> # A tibble: 6 x 3
#> extra group ID
#> <dbl> <fctr> <fctr>
#> 1 0.7 1 1
#> 2 -1.6 1 2
#> 3 -0.2 1 3
#> 4 -1.2 1 4
#> 5 -0.1 1 5
#> 6 3.4 1 6
glimpse(tb)
#> Observations: 20
#> Variables: 3
#> $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1....
#> $ group <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#> $ ID <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, ...Practice: Convert the data frames datasets::attitude and datasets::iris into tibbles and inspect their dimensions and contents. What types of variables do they contain?
2. tibble
Use tibble when the data to be used appears as a collection of columns. For instance, imagine we have the following information about a family:
| id | name | age | gender | drives | married_2 |
|---|---|---|---|---|---|
| 1 | Adam | 46 | male | TRUE | Eva |
| 2 | Eva | 48 | female | TRUE | Adam |
| 3 | Xaxi | 21 | female | FALSE | Zenon |
| 4 | Yota | 19 | female | TRUE | NA |
| 5 | Zack | 17 | male | FALSE | NA |
One way of viewing this table is as a series of columns. Each column consists of a variable name and the same number of (here: 5) values, which can be of different types (here: numbers, characters, or Boolean truth values). Each column may or may not contain missing values (entered as NA).
The tibble command expects that each column of the table is entered as a vector:
## Create a tibble from vectors (column-by-column):
fm <- tibble(
id = c(1, 2, 3, 4, 5), # OR: id = 1:5,
name = c("Adam", "Eva", "Xaxi", "Yota", "Zack"),
age = c(46, 48, 21, 19, 17),
gender = c("male", rep("female", 3), "male"),
drives = c(TRUE, TRUE, FALSE, TRUE, FALSE),
married_2 = c("Eva", "Adam", "Zenon", NA, NA)
)
fm # prints the tibble:
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>Note some details:
Each vector is labeled by the variable (column) name, which is not put into quotes;
Avoid spaces within variable (column) names (or enclose names in single quotes if you really
must use spaces);All vectors need to have the same length;
Each vector is of a single type (numeric, character, or Boolean truth values);
Consecutive vectors are separated by commas (but there is no comma after the final vector).
When using tibble, later vectors may use the values of earlier vectors:
# Using earlier vectors when defining later ones:
abc <- tibble(
ltr = LETTERS[1:5],
num = 1:5,
l_n = paste(ltr, num, sep = "_"), # combining abc with num
nsq = num^2 # squaring num
)
abc # prints the tibble:
#> # A tibble: 5 x 4
#> ltr num l_n nsq
#> <chr> <int> <chr> <dbl>
#> 1 A 1 A_1 1
#> 2 B 2 B_2 4
#> 3 C 3 C_3 9
#> 4 D 4 D_4 16
#> 5 E 5 E_5 25Practice: Find some tabular data online (e.g., on Wikipedia) and enter it as a tibble.
3. tribble
Use tribble when the data to be used appears as a collection of rows (or already is in tabular form).
For instance, when you copy and paste the above family data from an electronic document, it is easy to insert commas between consecutive cell values and use tribble to convert it into a tibble:
## Create a tibble from tabular data (row-by-row):
fm2 <- tribble(
~id, ~name, ~age, ~gender, ~drives, ~married_2,
#--|------|-----|--------|----------|----------|
1, "Adam", 46, "male", TRUE, "Eva",
2, "Eva", 48, "female", TRUE, "Adam",
3, "Xaxi", 21, "female", FALSE, "Zenon",
4, "Yota", 19, "female", TRUE, NA,
5, "Zack", 17, "male", FALSE, NA )
fm2 # prints the tibble:
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>Note some details:
The column names are preceded by
~;Consecutive entries are separated by a comma (but there is no comma after the final entry);
The line
#--|------|-----|--------|----------|----------|is commented out and can be omitted;The type of each column is determined by the type of the corresponding cell values. For instance, the NA values in
fm2are missing character values because the entries above were characters (entered in quotes).
Check: If tibble and tribble really are alternative commands, then the contents of our objects fm and fm2 should be identical:
# Are fm and fm2 equal?
all.equal(fm, fm2)
#> [1] TRUEPractice: Enter the tibble abc by using tribble.
Accessing parts of a tibble
Once we have an R object that is a tibble, we often want to access individual parts of it. We can distinguish between 3 simple cases:
1. Variables (columns)
As each column of a tibble is a vector, obtaining a column amounts to obtaining the corresponding vector. We can access this vector by its name (label) or by its number (column position):
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Get the name column of fm:
fm$name # by label (with $)
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
fm[["name"]] # by label (with [])
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
fm[[2]] # by number (with [])
#> [1] "Adam" "Eva" "Xaxi" "Yota" "Zack"
# Get the age column of fm:
fm$age # by name (with $)
#> [1] 46 48 21 19 17
fm[["age"]] # by name (with [])
#> [1] 46 48 21 19 17
fm[[3]] # by number (with [])
#> [1] 46 48 21 19 17
# Note: The following all yield the same vectors as a tibble:
fm[ , 2] # yields the name vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, 2)
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
select(fm, name)
#> # A tibble: 5 x 1
#> name
#> <chr>
#> 1 Adam
#> 2 Eva
#> 3 Xaxi
#> 4 Yota
#> 5 Zack
fm[ , 3] # yields the age vector as a (5 x 1) tibble
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17
select(fm, 3)
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17
select(fm, age)
#> # A tibble: 5 x 1
#> age
#> <dbl>
#> 1 46
#> 2 48
#> 3 21
#> 4 19
#> 5 17Practice: Extract the price column of ggplot2::diamonds in at least 3 different ways and verify that they all yield the same mean price.
2. Cases (rows)
Extracting specific rows of a tibble amounts to filtering a tibble and typically yields smaller tibbles (as a row may contain entries of different types). The best way of filtering specific rows of a tibble is using dplyr::filter. However, it’s also possible to specify the desired rows by subsetting (i.e., specifying a condition that results in a Boolean value) and by row number:
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Filter specific rows (by condition):
filter(fm, id > 2)
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
filter(fm, age < 18)
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm %>% filter(drives == TRUE)
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
# The same filters by using Boolean vectors (subsetting):
fm[fm$id > 2, ]
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
fm[fm$age < 18, ]
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm[fm$drives == TRUE, ]
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>
# The same filters by providing specific row numbers:
fm[3:5, ] # getting rows 3 to 5 of fm
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 3 Xaxi 21 female FALSE Zenon
#> 2 4 Yota 19 female TRUE <NA>
#> 3 5 Zack 17 male FALSE <NA>
fm[5, ] # getting row 5 of fm
#> # A tibble: 1 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 5 Zack 17 male FALSE <NA>
fm[c(1, 2, 4), ] # getting rows 1, 2, and 4 of fm
#> # A tibble: 3 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 4 Yota 19 female TRUE <NA>Practice: Extract all diamonds from ggplot2::diamonds that have at least 2 carat. How many of them are there and what is their mean price?
3. Cells
Accessing the values of individual tibble cells is relatively rare, but can be achieved by
a. explicitly providing both row number `r` and column number `c` (as `[r, c]`), or by
b. first extracting the column (as a vector `v`) and then providing the desired row number `r` (`v[r]`).
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zenon
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# Getting specific cell values:
fm$name[4] # getting the name of the 4th row
#> [1] "Yota"
fm[4, 2] # getting the same name by row and column numbers
#> # A tibble: 1 x 1
#> name
#> <chr>
#> 1 Yota
# Note: What if we don't know the row number?
which(fm$name == "Yota") # getting the row number that contains the name "Yota"
#> [1] 4In practice, accessing individual cell values is mostly needed to check for specific cell values and to change or correct erroneous entries by re-assigning them to a different value.
# Checking and changing cell values:
# Check: "Who is Xaxi's spouse?" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2
#> [1] "Zenon"
fm$married_2[3]
#> [1] "Zenon"
fm[3, 6]
#> # A tibble: 1 x 1
#> married_2
#> <chr>
#> 1 Zenon
# Change: "Zenon" is actually "Zeus" in 3 different ways:
fm[fm$name == "Xaxi", ]$married_2 <- "Zeus"
fm$married_2[3] <- "Zeus"
fm[3, 6] <- "Zeus"
# Check for successful change:
fm
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>By contrast, a relatively common task is to check an entire tibble for missing values, count them, or replace them by some other value:
# Checking for, counting, and changing missing values:
fm # family tibble (defined above):
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE <NA>
#> 5 5 Zack 17 male FALSE <NA>
# (a) Check for missing values:
is.na(fm) # checks each cell value for being NA
#> id name age gender drives married_2
#> [1,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE FALSE TRUE
#> [5,] FALSE FALSE FALSE FALSE FALSE TRUE
# (b) Count the number of missing values:
sum(is.na(fm)) # counts missing values (by adding up all TRUE values)
#> [1] 2
# (c) Change all missing values:
fm[is.na(fm)] <- "A MISSING value!"
# Check for successful change:
fm
#> # A tibble: 5 x 6
#> id name age gender drives married_2
#> <dbl> <chr> <dbl> <chr> <lgl> <chr>
#> 1 1 Adam 46 male TRUE Eva
#> 2 2 Eva 48 female TRUE Adam
#> 3 3 Xaxi 21 female FALSE Zeus
#> 4 4 Yota 19 female TRUE A MISSING value!
#> 5 5 Zack 17 male FALSE A MISSING value!Practice: Determine the number and the percentage of missing values in the datasets dplyr::starwars and dplyr::storms.
More advanced operations on tibbles are covered in Chapter 5: Data transformation and involve using the dplyr commands arrange, filter, and select.
More on tibbles
For more details on tibbles,
- study
vignette("tibble")and the documentation for?tibble; - study https://tibble.tidyverse.org/ and its examples;
- read Chapter 10: Tibbles and complete its exercises.
Data transformation
Overview
When we have data in the form of a tibble or data frame, dplyr provides a range of simple tools to transform this data. Six essential dplyr commands are:
arrangesorts cases (rows);filterselects cases (rows) by logical conditions;selectselects and reorders variables (columns);mutatecomputes new variables (columns) and adds them to existing ones;summarisecollapses multiple values of a variable (rows of a column) to a single one;
group_bychanges the unit of aggregation (in combination withmutateandsummarise).
Not quite as essential but still useful dplyr commands include:
sliceselects (ranges of) cases (rows) by number;renamerenames variables (columns) and keeps others;transmutecomputes new variables (columns) and drops existing ones;sample_nandsample_fracdraw random samples of cases (rows).
Commands and examples
We save the dplyr::starwars data as a tibble sw and use it to illustrate the essential dplyr commands.
library(tidyverse)
sw <- dplyr::starwars
sw # => A tibble: 87 rows (individuals) x 13 columns (variables)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 R5-D4 97 32 <NA> white, red red
#> 9 Biggs Darklighter 183 84 black light brown
#> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>Practice: How many sw variables (columns) are there and of which type are they? How many missing (NA) values are there?
1. arrange to sort rows
Using arrange sorts cases (rows) by putting specific variables (columns) in specific orders (e.g., ascending or descending):
# Sort rows alphabetically (by name):
arrange(sw, name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Ackbar 180 83 none brown mottle
#> 2 Adi Gallia 184 50 none dark
#> 3 Anakin Skywalker 188 84 blond fair
#> 4 Arvel Crynyd NA NA brown fair
#> 5 Ayla Secura 178 55 none blue
#> 6 Bail Prestor Organa 191 NA black tan
#> 7 Barriss Offee 166 50 black yellow
#> 8 BB8 NA NA none none
#> 9 Ben Quadinaros 163 65 none grey, green, yellow
#> 10 Beru Whitesun lars 165 75 brown light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
arrange(name)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Ackbar 180 83 none brown mottle
#> 2 Adi Gallia 184 50 none dark
#> 3 Anakin Skywalker 188 84 blond fair
#> 4 Arvel Crynyd NA NA brown fair
#> 5 Ayla Secura 178 55 none blue
#> 6 Bail Prestor Organa 191 NA black tan
#> 7 Barriss Offee 166 50 black yellow
#> 8 BB8 NA NA none none
#> 9 Ben Quadinaros 163 65 none grey, green, yellow
#> 10 Beru Whitesun lars 165 75 brown light
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# Sort rows in descending order:
sw %>%
arrange(desc(name))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Zam Wesell 168 55 blonde fair, green, yellow
#> 2 Yoda 66 17 white green
#> 3 Yarael Poof 264 NA none white
#> 4 Wilhuff Tarkin 180 NA auburn, grey fair
#> 5 Wicket Systri Warrick 88 20 brown brown
#> 6 Wedge Antilles 170 77 brown fair
#> 7 Watto 137 NA black blue, grey
#> 8 Wat Tambor 193 48 none green, grey
#> 9 Tion Medon 206 80 none grey
#> 10 Taun We 213 NA none grey
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# Sort by multiple variables:
sw %>%
arrange(eye_color, gender, desc(height))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Taun We 213 NA none grey black
#> 2 Shaak Ti 178 57 none red, blue, white black
#> 3 Lama Su 229 88 none grey black
#> 4 Tion Medon 206 80 none grey black
#> 5 Kit Fisto 196 87 none green black
#> 6 Plo Koon 188 80 none orange black
#> 7 Greedo 173 74 <NA> green black
#> 8 Nien Nunb 160 68 none grey black
#> 9 Gasgano 122 NA none white, blue black
#> 10 BB8 NA NA none none black
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## Note: See
# ?dplyr::arrange # for more help and examples.Note some details:
All basic
dplyrcommands can be called asverb(data, ...)or – using the pipe frommagrittr– asdata %>% verb(...)(seevignette("magrittr")for details).Variable names are unquoted.
The order of variable names (
x, y, ...) specifies the order or priority of operations (first byx, then byy, etc.).
Practice: Arrange the sw data in different ways, combining multiple variables and (ascending and descending) orders. Where are cases containing NA values in sorted variables placed?
2. filter to select rows
Using filter selects cases (rows) by logical conditions. It keeps all rows for which the conditions are TRUE and drops all rows for which the conditions are FALSE or NA.
# Filter to keep all humans:
filter(sw, species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 Darth Vader 202 136 none white yellow
#> 3 Leia Organa 150 49 brown light brown
#> 4 Owen Lars 178 120 brown, grey light blue
#> 5 Beru Whitesun lars 165 75 brown light blue
#> 6 Biggs Darklighter 183 84 black light brown
#> 7 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> 8 Anakin Skywalker 188 84 blond fair blue
#> 9 Wilhuff Tarkin 180 NA auburn, grey fair blue
#> 10 Han Solo 180 80 brown fair brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# The same command using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
filter(species == "Human")
#> # A tibble: 35 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 Darth Vader 202 136 none white yellow
#> 3 Leia Organa 150 49 brown light brown
#> 4 Owen Lars 178 120 brown, grey light blue
#> 5 Beru Whitesun lars 165 75 brown light blue
#> 6 Biggs Darklighter 183 84 black light brown
#> 7 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> 8 Anakin Skywalker 188 84 blond fair blue
#> 9 Wilhuff Tarkin 180 NA auburn, grey fair blue
#> 10 Han Solo 180 80 brown fair brown
#> # ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter by multiple (additive) conditions:
sw %>%
filter(height > 180, mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Jar Jar Binks 196 66 none orange orange 52
#> 2 Adi Gallia 184 50 none dark blue NA
#> 3 Wat Tambor 193 48 none green, grey unknown NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# The same command using the logical operator (&):
sw %>%
filter(height > 180 & mass <= 75) # tall and light individuals
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Jar Jar Binks 196 66 none orange orange 52
#> 2 Adi Gallia 184 50 none dark blue NA
#> 3 Wat Tambor 193 48 none green, grey unknown NA
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
# Filter for a range of a specific variable:
sw %>%
filter(height >= 150, height <= 165) # (a) using height twice
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Leia Organa 150 49 brown light brown
#> 2 Beru Whitesun lars 165 75 brown light blue
#> 3 Mon Mothma 150 NA auburn fair blue
#> 4 Nien Nunb 160 68 none grey black
#> 5 Shmi Skywalker 163 NA black fair brown
#> 6 Ben Quadinaros 163 65 none grey, green, yellow orange
#> 7 Cordé 157 NA brown light brown
#> 8 Dormé 165 NA brown light brown
#> 9 Padmé Amidala 165 45 brown light brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
sw %>%
filter(between(height, 150, 165)) # (b) using between(...)
#> # A tibble: 9 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Leia Organa 150 49 brown light brown
#> 2 Beru Whitesun lars 165 75 brown light blue
#> 3 Mon Mothma 150 NA auburn fair blue
#> 4 Nien Nunb 160 68 none grey black
#> 5 Shmi Skywalker 163 NA black fair brown
#> 6 Ben Quadinaros 163 65 none grey, green, yellow orange
#> 7 Cordé 157 NA brown light brown
#> 8 Dormé 165 NA brown light brown
#> 9 Padmé Amidala 165 45 brown light brown
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
# Filter by multiple (alternative) conditions:
sw %>%
filter(homeworld == "Kashyyyk" | skin_color == "green")
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Chewbacca 228 112 brown unknown blue
#> 2 Greedo 173 74 <NA> green black
#> 3 Yoda 66 17 white green brown
#> 4 Bossk 190 113 none green red
#> 5 Rugor Nass 206 NA none green orange
#> 6 Kit Fisto 196 87 none green black
#> 7 Poggle the Lesser 183 80 none green yellow
#> 8 Tarfful 234 136 brown brown blue
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
# Filter cases with missing (NA) values on specific variables:
sw %>%
filter(is.na(gender))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 2 R2-D2 96 32 <NA> white, blue red 33 <NA>
#> 3 R5-D4 97 32 <NA> white, red red NA <NA>
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Filter cases with existing (non-NA) values on specific variables:
sw %>%
filter(!is.na(mass), !is.na(birth_year))
#> # A tibble: 36 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 Biggs Darklighter 183 84 black light brown
#> 9 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> 10 Anakin Skywalker 188 84 blond fair blue
#> # ... with 26 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## Note: See
# ?dplyr::filter # for more help and examples.Note some details:
Separating multiple conditions by commas is the same as the logical AND (
&).Variable names are unquoted.
The comma between conditions or tests (
x, y, ...) means the same as&(logical AND), as each test results in a vector of Boolean values.Unlike in base R, rows for which the condition evaluates to
NAare dropped.Additional filter functions include
near()for testing numerical (near-)identity.
Practice: Use filter on sw to select very diverse or narrow subsets of individuals. For instance,
- which individual with blond hair and blue eyes has an unknown mass?
- of which species are individuals that are over 2m tall and have brown hair?
- which individuals from Tatooine are not male (but may be
NA)? - which individuals are neither male nor female OR heavier than 130kg?
3. select to select columns
Using select selects variables (columns) by their names or numbers:
# Select 4 specific variables (columns) of sw:
select(sw, name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when using the pipe:
sw %>% # Note: %>% is NOT + (used in ggplot)
select(name, species, birth_year, gender)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when providing a vector of variable names:
sw %>%
select(c(name, species, birth_year, gender))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when providing column numbers:
sw %>%
select(1, 10, 7, 8)
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# The same when providing a vector of column numbers:
sw %>%
select(c(1, 10, 7, 8))
#> # A tibble: 87 x 4
#> name species birth_year gender
#> <chr> <chr> <dbl> <chr>
#> 1 Luke Skywalker Human 19.0 male
#> 2 C-3PO Droid 112.0 <NA>
#> 3 R2-D2 Droid 33.0 <NA>
#> 4 Darth Vader Human 41.9 male
#> 5 Leia Organa Human 19.0 female
#> 6 Owen Lars Human 52.0 male
#> 7 Beru Whitesun lars Human 47.0 female
#> 8 R5-D4 Droid NA <NA>
#> 9 Biggs Darklighter Human 24.0 male
#> 10 Obi-Wan Kenobi Human 57.0 male
#> # ... with 77 more rows
# Select ranges of variables with ":":
sw %>%
select(name:mass, films:starships)
#> # A tibble: 87 x 6
#> name height mass films vehicles starships
#> <chr> <int> <dbl> <list> <list> <list>
#> 1 Luke Skywalker 172 77 <chr [5]> <chr [2]> <chr [2]>
#> 2 C-3PO 167 75 <chr [6]> <chr [0]> <chr [0]>
#> 3 R2-D2 96 32 <chr [7]> <chr [0]> <chr [0]>
#> 4 Darth Vader 202 136 <chr [4]> <chr [0]> <chr [1]>
#> 5 Leia Organa 150 49 <chr [5]> <chr [1]> <chr [0]>
#> 6 Owen Lars 178 120 <chr [3]> <chr [0]> <chr [0]>
#> 7 Beru Whitesun lars 165 75 <chr [3]> <chr [0]> <chr [0]>
#> 8 R5-D4 97 32 <chr [1]> <chr [0]> <chr [0]>
#> 9 Biggs Darklighter 183 84 <chr [1]> <chr [0]> <chr [1]>
#> 10 Obi-Wan Kenobi 182 77 <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
# Select to re-order variables (columns) with everything():
sw %>%
select(species, name, gender, everything())
#> # A tibble: 87 x 13
#> species name gender height mass hair_color
#> <chr> <chr> <chr> <int> <dbl> <chr>
#> 1 Human Luke Skywalker male 172 77 blond
#> 2 Droid C-3PO <NA> 167 75 <NA>
#> 3 Droid R2-D2 <NA> 96 32 <NA>
#> 4 Human Darth Vader male 202 136 none
#> 5 Human Leia Organa female 150 49 brown
#> 6 Human Owen Lars male 178 120 brown, grey
#> 7 Human Beru Whitesun lars female 165 75 brown
#> 8 Droid R5-D4 <NA> 97 32 <NA>
#> 9 Human Biggs Darklighter male 183 84 black
#> 10 Human Obi-Wan Kenobi male 182 77 auburn, white
#> # ... with 77 more rows, and 7 more variables: skin_color <chr>,
#> # eye_color <chr>, birth_year <dbl>, homeworld <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Select variables with helper functions:
sw %>%
select(starts_with("s"))
#> # A tibble: 87 x 3
#> skin_color species starships
#> <chr> <chr> <list>
#> 1 fair Human <chr [2]>
#> 2 gold Droid <chr [0]>
#> 3 white, blue Droid <chr [0]>
#> 4 white Human <chr [1]>
#> 5 light Human <chr [0]>
#> 6 light Human <chr [0]>
#> 7 light Human <chr [0]>
#> 8 white, red Droid <chr [0]>
#> 9 light Human <chr [1]>
#> 10 fair Human <chr [5]>
#> # ... with 77 more rows
sw %>%
select(ends_with("s"))
#> # A tibble: 87 x 5
#> mass species films vehicles starships
#> <dbl> <chr> <list> <list> <list>
#> 1 77 Human <chr [5]> <chr [2]> <chr [2]>
#> 2 75 Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 32 Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 136 Human <chr [4]> <chr [0]> <chr [1]>
#> 5 49 Human <chr [5]> <chr [1]> <chr [0]>
#> 6 120 Human <chr [3]> <chr [0]> <chr [0]>
#> 7 75 Human <chr [3]> <chr [0]> <chr [0]>
#> 8 32 Droid <chr [1]> <chr [0]> <chr [0]>
#> 9 84 Human <chr [1]> <chr [0]> <chr [1]>
#> 10 77 Human <chr [6]> <chr [1]> <chr [5]>
#> # ... with 77 more rows
sw %>%
select(contains("_"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color birth_year
#> <chr> <chr> <chr> <dbl>
#> 1 blond fair blue 19.0
#> 2 <NA> gold yellow 112.0
#> 3 <NA> white, blue red 33.0
#> 4 none white yellow 41.9
#> 5 brown light brown 19.0
#> 6 brown, grey light blue 52.0
#> 7 brown light blue 47.0
#> 8 <NA> white, red red NA
#> 9 black light brown 24.0
#> 10 auburn, white fair blue-gray 57.0
#> # ... with 77 more rows
sw %>%
select(matches("or"))
#> # A tibble: 87 x 4
#> hair_color skin_color eye_color homeworld
#> <chr> <chr> <chr> <chr>
#> 1 blond fair blue Tatooine
#> 2 <NA> gold yellow Tatooine
#> 3 <NA> white, blue red Naboo
#> 4 none white yellow Tatooine
#> 5 brown light brown Alderaan
#> 6 brown, grey light blue Tatooine
#> 7 brown light blue Tatooine
#> 8 <NA> white, red red Tatooine
#> 9 black light brown Tatooine
#> 10 auburn, white fair blue-gray Stewjon
#> # ... with 77 more rows
# Renaming variables:
sw %>%
rename(creature = name, from_planet = homeworld)
#> # A tibble: 87 x 13
#> creature height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 R5-D4 97 32 <NA> white, red red
#> 9 Biggs Darklighter 183 84 black light brown
#> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, from_planet <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## Note: See
# ?dplyr::select # for more help and examples.
?dplyr::select_if # for more help and examples. Note some details:
selectworks both by specifying variable (column) names and by specifying column numbers.Variable names are unquoted.
The sequence of variable names (separated by commas) specifies the order of columns in the resulting tibble.
Selecting and adding
everything()allows re-ordering.Various helper functions (e.g.,
starts_with,ends_with,contains,matches,num_range) refer to (parts of) variable names.renamerenames specified variables (without quotes) and keeps all other variables.
Practice: Use select on sw to select and re-order specific subsets of variables (e.g., all variables starting with “h”, all even columns, all character variables, etc.).
4. mutate to compute new variables
Using mutate computes new variables (columns) from scratch or existing ones:
# Preparation: Save only a subset variables of sw as sws:
sws <- select(sw, name:mass, birth_year:species)
sws # => 87 cases (rows), but only 7 variables (columns)
#> # A tibble: 87 x 7
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows
# Compute 2 new variables and add them to existing ones:
mutate(sws, id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>
# The same using the pipe:
sws %>%
mutate(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 9
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 2 more variables: id <int>, height_feet <dbl>
# Transmute commputes and only keeps new variables:
sws %>%
transmute(id = 1:nrow(sw), height_feet = .032808399 * height)
#> # A tibble: 87 x 2
#> id height_feet
#> <int> <dbl>
#> 1 1 5.643045
#> 2 2 5.479003
#> 3 3 3.149606
#> 4 4 6.627297
#> 5 5 4.921260
#> 6 6 5.839895
#> 7 7 5.413386
#> 8 8 3.182415
#> 9 9 6.003937
#> 10 10 5.971129
#> # ... with 77 more rows
# Compute variables based on multiple others (including computed ones):
sws %>%
mutate(BMI = mass / ((height / 100) ^ 2), # compute body mass index (kg/m^2)
BMI_low = BMI < 18.5, # classify low BMI values
BMI_high = BMI > 30, # classify high BMI values
BMI_norm = !BMI_low & !BMI_high # classify normal BMI values
)
#> # A tibble: 87 x 11
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 4 more variables: BMI <dbl>, BMI_low <lgl>,
#> # BMI_high <lgl>, BMI_norm <lgl>
## Note: See
# ?dplyr::mutate # for more help and examples. Note some details:
mutatecomputes new variables (columns) and adds them to existing ones, whiletransmutedrops existing ones.Each
mutatecommand specifies a new variable name (without quotes), followed by=and a rule for computing the new variable from existing ones.Variable names are unquoted.
Multiple
mutatesteps are separated by commas, each of which creates a new variable.See http://r4ds.had.co.nz/transform.html#mutate-funs for useful functions for creating new variables.
Practice: Compute a new variable mass_pound from mass (in kg) and the age of each individual in sw relative to Yoda’s age. (Note that the variable birth_year is provided in years BBY, i.e., Before Battle of Yavin.)
5. summarise to compute summaries
summarise computes a function for a specified variable and collapses the values of the specified variable (i.e., the rows of a specified columns) to a single value. It provides many different summary statistics by itself, but is even more useful in combination with group_by (discussed next).
# Summarise allows computing a function for a variable (column):
summarise(sw, mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.31186
# The same using the pipe:
sw %>%
summarise(mn_mass = mean(mass, na.rm = TRUE)) # => 97.31 kg
#> # A tibble: 1 x 1
#> mn_mass
#> <dbl>
#> 1 97.31186
# Multiple summarise steps allow applying
# different functions for 1 dependent variable:
sw %>%
summarise(n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
md_mass = median(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
max_mass = max(mass, na.rm = TRUE),
big_mass = any(mass > 1000)
)
#> # A tibble: 1 x 6
#> n_mass mn_mass md_mass sd_mass max_mass big_mass
#> <int> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 59 97.31186 79 169.4572 1358 TRUE
# Multiple summarise steps also allow applying
# different functions to different dependent variables:
sw %>%
summarise(# Descriptives of height:
n_height = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
# Descriptives of mass:
n_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE),
# Counts of character variables:
n_names = n(),
n_species = n_distinct(species),
n_worlds = n_distinct(homeworld)
)
#> # A tibble: 1 x 9
#> n_height mn_height sd_height n_mass mn_mass sd_mass n_names n_species
#> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 81 174.358 34.77043 59 97.31186 169.4572 87 38
#> # ... with 1 more variables: n_worlds <int>
## Note: See
# ?dplyr::summarise # for more help and examples. Note some details:
summarisecollapses multiple values into one value and returns a new tibble with as many rows as values computed.Each
summarisestep specifies a new variable name (without quotes), followed by=, and a function for computing the new variable from existing ones.Multiple
summarisesteps are separated by commas.Variable names are unquoted.
See https://dplyr.tidyverse.org/reference/summarise.html for examples and useful functions in combination with
summarise.
Practice: Apply all summary functions mentioned in ?dplyr::summarise to the sw dataset.
6. group_by to aggregate variables
Using group_by does not change the data, but the unit of aggregation for other commands, which is very useful in combination with mutate and summarise.
# Grouping does not change the data, but lists its groups:
group_by(sws, species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows
# The same using the pipe:
sws %>%
group_by(species) # => 38 groups of species
#> # A tibble: 87 x 7
#> # Groups: species [38]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows
# group_by is ineffective by itself, but very powerful
# (a) in combination with `mutate` and
# (b) in combination with `summarise`.
# ad (a):
# In combination with mutate and an aggregation function,
# group_by changes the unit of aggregation:
sws %>%
mutate(mn_height_1 = mean(height, na.rm = TRUE)) %>% # aggregates over ALL cases
group_by(species) %>%
mutate(mn_height_2 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (species)
group_by(gender) %>%
mutate(mn_height_3 = mean(height, na.rm = TRUE)) %>% # aggregates over current group (gender)
group_by(name) %>%
mutate(mn_height_4 = mean(height, na.rm = TRUE)) # aggregates over current group (name)
#> # A tibble: 87 x 11
#> # Groups: name [87]
#> name height mass birth_year gender homeworld species
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 19.0 male Tatooine Human
#> 2 C-3PO 167 75 112.0 <NA> Tatooine Droid
#> 3 R2-D2 96 32 33.0 <NA> Naboo Droid
#> 4 Darth Vader 202 136 41.9 male Tatooine Human
#> 5 Leia Organa 150 49 19.0 female Alderaan Human
#> 6 Owen Lars 178 120 52.0 male Tatooine Human
#> 7 Beru Whitesun lars 165 75 47.0 female Tatooine Human
#> 8 R5-D4 97 32 NA <NA> Tatooine Droid
#> 9 Biggs Darklighter 183 84 24.0 male Tatooine Human
#> 10 Obi-Wan Kenobi 182 77 57.0 male Stewjon Human
#> # ... with 77 more rows, and 4 more variables: mn_height_1 <dbl>,
#> # mn_height_2 <dbl>, mn_height_3 <dbl>, mn_height_4 <dbl>
# ad (b):
# group_by is particularly useful in combination
# with summarise:
sws %>%
group_by(homeworld) %>%
summarise(count = n(),
mn_height = mean(height, na.rm = TRUE),
mn_mass = mean(mass, na.rm = TRUE)
)
#> # A tibble: 49 x 4
#> homeworld count mn_height mn_mass
#> <chr> <int> <dbl> <dbl>
#> 1 Alderaan 3 176.3333 64.0
#> 2 Aleen Minor 1 79.0000 15.0
#> 3 Bespin 1 175.0000 79.0
#> 4 Bestine IV 1 180.0000 110.0
#> 5 Cato Neimoidia 1 191.0000 90.0
#> 6 Cerea 1 198.0000 82.0
#> 7 Champala 1 196.0000 NaN
#> 8 Chandrila 1 150.0000 NaN
#> 9 Concord Dawn 1 183.0000 79.0
#> 10 Corellia 2 175.0000 78.5
#> # ... with 39 more rows
# Note that this pipe returns a new tibble,
# with 49 rows (= different levels of homeworld) and
# - 1 column of the group variable (homeworld) and
# - 3 columns of the 3 newly summarised variables.
# group_by used with multiple variables yields a tibble
# containing the combination of all variable levels:
sw %>%
group_by(hair_color, eye_color) # => 35 groups (combinations)
#> # A tibble: 87 x 13
#> # Groups: hair_color, eye_color [35]
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywalker 172 77 blond fair blue
#> 2 C-3PO 167 75 <NA> gold yellow
#> 3 R2-D2 96 32 <NA> white, blue red
#> 4 Darth Vader 202 136 none white yellow
#> 5 Leia Organa 150 49 brown light brown
#> 6 Owen Lars 178 120 brown, grey light blue
#> 7 Beru Whitesun lars 165 75 brown light blue
#> 8 R5-D4 97 32 <NA> white, red red
#> 9 Biggs Darklighter 183 84 black light brown
#> 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# Counting the frequency of cases in groups:
sw %>%
group_by(hair_color, eye_color) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color, eye_color [35]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
# The same using summarise:
sw %>%
group_by(hair_color, eye_color) %>%
summarise(n = n()) %>%
arrange(desc(n))
#> # A tibble: 35 x 3
#> # Groups: hair_color [13]
#> hair_color eye_color n
#> <chr> <chr> <int>
#> 1 black brown 9
#> 2 brown brown 9
#> 3 none black 9
#> 4 brown blue 7
#> 5 none orange 7
#> 6 none yellow 6
#> 7 blond blue 3
#> 8 none blue 3
#> 9 none red 3
#> 10 black blue 2
#> # ... with 25 more rows
## Note: See
# ?dplyr::group_by # for more help and examples. Note some details:
group_bychanges the unit of aggregation for other commands (mutateandsummarise).Variable names are unquoted.
When using
group_bywith multiple variables, they are separated by commas.Using
group_bywithmutateresults in a tibble that has the same number of cases (rows) as the original tibble. By contrast, usinggroup_bywithsummariseresults in a new tibble with all combinations of variable levels as its cases (rows).
Practice: Create some groups and compute descriptive statistics (n, mean, median, standard deviation, …) for some variables. For instance,
What is the number and mean height and mass of individuals from Tatooine by species and gender?
Which humans are more than 5cm taller then the average human overall?
Which humans are more than 5cm taller than the average human of their own gender?
Combining commands
The essential dplyr commands are quite simple by themselves, but form the basic verbs of a language for data manipulation. The commands become particularly powerful when they are combined into pipes (by using %>%). Stringing together several dplyr commands allows slicing and dicing data (tibbles or data frames) in a step-wise fashion to run non-trivial data analyses on the fly.
Practice: Tidyverse meets universe
Answer the following questions about the dplyr::starwars dataset by using pipes of essential dplyr commands:
a. Basics:
- Save the tibble
dplyr::starwarsasswand report its dimensions.
b. Missing values and known unknowns:
How many missing (
NA) values doesswcontain?Which individuals come from an unknown (missing)
homeworldbut have a knownbirth_yearor knownmass?
c. Gender issues:
How many humans are contained in
swoverall and by gender?How many and which individuals in
sware neither male nor female?Of which species in
swexist at least 2 different gender values?
d. Popular homes and heights:
From which
homeworlddo the most indidividuals (rows) come from?What is the mean
heightof all individuals with orange eyes from the most popular homeworld?
e. Size and mass issues:
Compute the median, mean, and standard deviation of
heightfor all droids.Compute the average height and mass by species and save the result as
h_m.Sort
h_mto list the 3 species with the smallest individuals (in terms of mean height).Sort
h_mto list the 3 species with the heaviest individuals (in terms of median mass).
f. Counting and arranging:
- How many individuals exist of the three most frequent (known) species?
g. Grouped mutates:
- Which individuals are more than 20% lighter than the average mass of individuals of their own homeworld?
# library(tidyverse)
# ?dplyr::starwars
## (a) Basic data properties: ----
sw <- dplyr::starwars
dim(sw) # => 87 rows (denoting individuals) x 13 columns (variables)
#> [1] 87 13
## (b) Missing data: -----
## (+) How many missing data points?
sum(is.na(sw)) # => 101 missing values.
#> [1] 101
# (+) Which individuals come from an unknown (missing) homeworld
# but have a known birth_year or mass?
sw %>%
filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Yoda 66 17 white green brown 896
#> 2 IG-88 200 140 none metal red 15
#> 3 Qui-Gon Jinn 193 89 brown fair blue 92
#> # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
## (x) Which variable (column) has the most missing values?
colSums(is.na(sw)) # => birth_year has 44 missing values
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 0
colMeans(is.na(sw)) # (amounting to 50.1% of all cases).
#> name height mass hair_color skin_color eye_color
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000
#> birth_year gender homeworld species films vehicles
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000
#> starships
#> 0.00000000
## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald":
# sw$hair_color[is.na(sw$hair_color)] <- "bald"
## (c) Gender issues: -----
# (+) How many humans are there of each gender?
sw %>%
filter(species == "Human") %>%
group_by(gender) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: gender [2]
#> gender n
#> <chr> <int>
#> 1 female 9
#> 2 male 26
## Answer: 35 Humans in total: 9 females, 26 male.
# (+) How many and which individuals are neither male nor female?
sw %>%
filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color
#> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Jabba Desilijic Tiure 175 1358 <NA> green-tan, brown orange
#> 2 IG-88 200 140 none metal red
#> 3 BB8 NA NA none none black
#> # ... with 7 more variables: birth_year <dbl>, gender <chr>,
#> # homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#> # starships <list>
# (+) Of which species are there at least 2 different gender values?
sw %>%
group_by(species, gender) %>%
count() %>% # table shows species by gender:
group_by(species) %>% # Which species appear more than once in this table?
count() %>%
filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups: species [5]
#> species nn
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
## (d) Homeworld issues: -----
# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from?
sw %>%
group_by(homeworld) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups: homeworld [49]
#> homeworld n
#> <chr> <int>
#> 1 Naboo 11
#> 2 Tatooine 10
#> 3 <NA> 10
#> 4 Alderaan 3
#> 5 Coruscant 3
#> 6 Kamino 3
#> 7 Corellia 2
#> 8 Kashyyyk 2
#> 9 Mirial 2
#> 10 Ryloth 2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)
# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld?
sw %>%
filter(homeworld == "Naboo", eye_color == "orange") %>%
summarise(n = n(),
mn_height = mean(height))
#> # A tibble: 1 x 2
#> n mn_height
#> <int> <dbl>
#> 1 3 208.6667
## Note:
sw %>% filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Jabba Desilijic Tiure 175 1358 <NA> green-tan, brown
#> 2 Ackbar 180 83 none brown mottle
#> 3 Jar Jar Binks 196 66 none orange
#> 4 Roos Tarpals 224 82 none grey
#> 5 Rugor Nass 206 NA none green
#> 6 Sebulba 112 40 none grey, red
#> 7 Ben Quadinaros 163 65 none grey, green, yellow
#> 8 Saesee Tiin 188 NA none pale
#> # ... with 8 more variables: eye_color <chr>, birth_year <dbl>,
#> # gender <chr>, homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) What is the mass and homeworld of the smallest droid?
sw %>%
filter(species == "Droid") %>%
arrange(height)
#> # A tibble: 5 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 R2-D2 96 32 <NA> white, blue red 33 <NA>
#> 2 R5-D4 97 32 <NA> white, red red NA <NA>
#> 3 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 4 IG-88 200 140 none metal red 15 none
#> 5 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## (e) Size and mass: Group summaries: -----
# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
filter(species == "Droid") %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
md_height = median(height, na.rm = TRUE),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#> n not_NA_h md_height mn_height sd_height
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 5 4 132 140 52.00641
# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
group_by(species) %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
not_NA_m = sum(!is.na(mass)),
md_mass = median(mass, na.rm = TRUE)
)
h_m
#> # A tibble: 38 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Aleena 1 1 79.0000 1 15.0
#> 2 Besalisk 1 1 198.0000 1 102.0
#> 3 Cerean 1 1 198.0000 1 82.0
#> 4 Chagrian 1 1 196.0000 0 NA
#> 5 Clawdite 1 1 168.0000 1 55.0
#> 6 Droid 5 4 140.0000 4 53.5
#> 7 Dug 1 1 112.0000 1 40.0
#> 8 Ewok 1 1 88.0000 1 20.0
#> 9 Geonosian 1 1 183.0000 1 80.0
#> 10 Gungan 3 3 208.6667 2 74.0
#> # ... with 28 more rows
# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Yoda's species 1 1 66 1 17
#> 2 Aleena 1 1 79 1 15
#> 3 Ewok 1 1 88 1 20
# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Hutt 1 1 175 1 1358
#> 2 Kaleesh 1 1 216 1 159
#> 3 Wookiee 2 2 231 2 124
## (+) Other questions: -----
# (f) How many individuals come from the 3 most frequent (known) species?
sw %>%
group_by(species) %>%
count %>%
arrange(desc(n)) %>%
filter(n > 1)
#> # A tibble: 9 x 2
#> # Groups: species [9]
#> species n
#> <chr> <int>
#> 1 Human 35
#> 2 Droid 5
#> 3 <NA> 5
#> 4 Gungan 3
#> 5 Kaminoan 2
#> 6 Mirialan 2
#> 7 Twi'lek 2
#> 8 Wookiee 2
#> 9 Zabrak 2
# (g) Which individuals are more than 20% lighter (in terms of mass)
# than the average mass of individuals of their own homeworld?
sw %>%
select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(n_notNA_mass = sum(!is.na(mass)),
mn_mass = mean(mass, na.rm = TRUE),
lighter = mass < (mn_mass - (.20 * mn_mass))
) %>%
filter(lighter == TRUE)
#> # A tibble: 5 x 6
#> # Groups: homeworld [4]
#> name homeworld mass n_notNA_mass mn_mass lighter
#> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 R2-D2 Naboo 32 6 64.16667 TRUE
#> 2 Leia Organa Alderaan 49 2 64.00000 TRUE
#> 3 R5-D4 Tatooine 32 8 85.37500 TRUE
#> 4 Yoda <NA> 17 3 82.00000 TRUE
#> 5 Padmé Amidala Naboo 45 6 64.16667 TRUEMore on data transformation
For more details on dplyr,
- study
vignette("dplyr")and the documentation for?arrange,?filter,?select, etc. - study https://dplyr.tidyverse.org/ and its examples;
- see the cheat sheet on data transformation;
- read Chapter 5: Data transformation and complete its exercises.
Visualizing data
In the following, we introduce some essential commands of ggplot2 in the context of examples. However, the ggplot2 package extends far beyond this modest introduction – it is an important pillar (and predecessor) of the tidyverse and implements a language for and philosophy of data visualisation.
See Chapter 3: Data visualization) and Chapter 7: Exploratory data analysis (EDA) and the links provided below for more detailed information.
Commands and examples
General structure of ggplot calls
A generic template for creating a graph with ggplot is:
# Generic ggplot template:
ggplot(data = <DATA>) +
<GEOM_fun>(mapping = aes(<MAPPING>), <arg_1 = val_1, ..., arg_n = val_n>) +
<FACET_fun> + # optional
<LOOK_GOOD_fun> # optional
# Minimal ggplot template:
ggplot(<DATA>) +
<GEOM_fun>(aes(<MAPPING>) The generic template includes the following parts:
<DATA>is a data frame or tibble that contains the data that is to be plotted.<GEOM_fun>is a function that maps data to a geometric object (“geom”) according to an aesthetic mapping that are specified inaes(<MAPPING>). (A “mapping” specifies what goes where.)- A geom’s visual appearance (e.g., colors, shapes, sizes, …) can be customized
- in the aesthetic mapping (when varying visual features according to data properties), or
- by setting its arguments to specific values in
<arg_1 = val_1, ..., arg_n = val_n>(when remaining constant).
An optional
<FACET_fun>splits a complex plot into multiple subplots.A sequence of optional
<LOOK_GOOD_fun>adjusts the visual features of plots (e.g., by adding themes, plot titles and labels, color scales, and coordinate systems).
Some examples that illustrate the use of these components are:
A histogram
A histogram counts how often specific values of one (typically continuous) variable occur in the data. This allows viewing the distribution of values for this variable:
library(ggplot2)
# Data: ------
# Using mpg data:
?ggplot2::mpg
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.80 1999 4 manual… f 21 29 p
#> 3 audi a4 2.00 2008 4 manual… f 20 31 p
#> 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.80 1999 6 manual… f 18 26 p
#> 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
# (A) Histogram: ------
# A minimal histogram:
hi1 <- ggplot(mpg, aes(x = cty)) + # set mappings for ALL geoms
geom_histogram(binwidth = 1)
hi1
# The same histogram:
hi1b <- ggplot(mpg) +
geom_histogram(aes(x = cty)) # set mappings for THIS geoms
hi1b
# (B) Adding aesthetics, labels and themes: ------
# Enhanced version of the same plot:
hi2 <- ggplot(mpg) +
geom_histogram(aes(x = cty), binwidth = 1, fill = "forestgreen", color = "black") +
labs(title = "Distribution of fuel economy in city environments",
x = "cty (miles per gallon)",
caption = "Data from ggplot2::mpg") +
theme_light()
hi2A scatterplot
A scatterplot shows a data point (observation) as a function of 2 (typically continuous) variables x and y. This allows judging the relationship between x and y in the data:
# (A) Scatterplot: ------
# A minimal scatterplot + reference line:
sp1 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline()
sp1Dealing with overplotting
A common issue with scatterplots is so-called overplotting: Multiple points appear on the same position.
Here are some ways of dealing with this issue:
jitteradds randomness to positions;
alphauses transparency to show frequency of positions;
geom_sizeallows mapping values (e.g., frequency) to object size;facet_wrapallows disentangling plots by levels of variables.
Some examples include:
## Dealing with overplotting: -----
# 1. One way of dealing with overplotting is
# adding randomness to point positions:
sp2 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "jitter") +
geom_abline()
sp2
# 2. Another way of dealing with overplotting is
# using transparency (via setting alpha to < 1):
sp3 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy), position = "identity",
pch = 21, fill = "steelblue", alpha = 1/4, size = 4) +
geom_abline(linetype = 2, color = "firebrick") # +
# geom_rug(aes(x = cty, y = hwy), position = "jitter", alpha = 1/4, size = 1)
sp3
# Adding labels and themes to plots:
sp4 <- sp3 + # use the plot defined above
labs(title = "Fuel economy on highway vs. city",
x = "City (miles per gallon)",
y = "Highway (miles per gallon)",
caption = "Data from ggplot2::mpg") +
# coord_fixed() +
theme_bw()
sp4
# (C) Grouping (by a categorical variable): ------
# Using facets to avoid overplotting:
sp5 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy)) +
geom_abline() +
facet_wrap(~class) +
theme_bw()
sp5
# Grouping by color:
sp6 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy, color = class),
position = "jitter", alpha = 1/2, size = 4) +
geom_abline(linetype = 2) +
theme_bw()
sp6
# Grouping by facets:
sp7 <- ggplot(mpg) +
geom_point(aes(x = cty, y = hwy),
position = "jitter", alpha = 1/2, size = 2) +
geom_abline(linetype = 2) +
facet_wrap(~class) +
theme_bw()
sp7See https://ggplot2.tidyverse.org/reference/ for more examples.
Note some details:
ggplotrequires data and maps independent variables to dimensions (e.g., the x- and y-axis) and dependent variables to geometric objects (called “geoms”). It typically assumes that the to-be-plotted<DATA>is in a table (data frame or tibble) in long format and contains independent variables as factors.The arguments
data =andmappings =can be omitted, but an aesthetic mappingaes(<MAPPING>)for at least one geom is needed.Different geoms can be combined, but their order matters (as later layers are printed on top of earlier ones).
When multiple geoms use the same mappings, their common
aes(<MAPPING>)can be moved into the initialggplotcall (behind<DATA>).In
ggplot, a sequence of commands is combined by+, rather than%>%.The visual appearance of plots are highly customizable (e.g., by supplying aesthetic arguments, speciying labels and legends, and applying pre-defined themes to plots).
EDA
Creating good graphs is both an art and a craft. The key to creating good graphs requires answering 2 sets of questions:
Knowing the number and type of variables to be plotted. This includes answering data-related questions like
- How many variables are there to plot?
- Are these variables categorical or continuous?
- Do some variables qualify (e.g., group) the values of others?
- How many variables are there to plot?
Knowing the intended type of plot. This includes answering functional questions like
- What is the purpose of this plot?
- What are possible plots for this purpose?
- Which of these would be the most appropriate plot?
Even when the questions of 1. and 2. are answered, creating good graphs with ggplot requires a lot of practice and many hours of trial-and-error experimentation.
Basic plot types
Histograms
A histogram shows counts of the values of 1 (typically continuous) variable. This is useful for evaluating the distribution of the variable:
library(ggplot2)
# Create data:
tb <- tibble(iq = rnorm(n = 1000, mean = 100, sd = 15))
# Basic histogram:
ggplot(tb) +
geom_histogram(aes(x = iq), binwidth = 5)
# Pimped histogram:
ggplot(tb) +
geom_histogram(aes(x = iq), binwidth = 5,
fill = "gold", color = "black") +
labs(title = "Histogram", x = "IQ values", y = "Frequency in sample (n)",
caption = "[Using random iq data.]") +
theme_classic()More on histograms:
Scatterplots
A scatterplot shows relationship between 2 (typically continuous) variables:
# Data:
ir <- as_tibble(iris)
ir
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows
# Basic scatterplot:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species))
# Using 3 different facets:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
facet_wrap(~Species)
# Pimped scatterplot:
ggplot(ir) +
geom_point(aes(x = Petal.Length, y = Petal.Width, fill = Species), pch = 21, color = "black", size = 2, alpha = 1/2) +
facet_wrap(~Species) +
# coord_fixed() +
labs(title = "Scatterplot", x = "Length of petal", y = "Width of petal",
caption = "[Using iris data.]") +
theme_bw() +
theme(legend.position = "none")More on scatterplots:
Bar plots
Another common type of plot shows the values (across different levels of some variable as the height of bars. As this plot type can use both categorical or continuous variables, it turns out to be surprisingly complex to create good bar charts. To us get started, here are only a few examples:
Counts of cases
By default, geom_bar computes summary statistics of the data. When nothing else is specified, geom_bar counts the number or frequency of values (i.e., stat = "count") and maps this count to the y (i.e., y = ..count..):
library(ggplot2)
## Data:
ggplot2::mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.80 1999 4 manual… f 21 29 p
#> 3 audi a4 2.00 2008 4 manual… f 20 31 p
#> 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.80 1999 6 manual… f 18 26 p
#> 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
# (1) Count number of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class))
# (b) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..))
# (c) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class), stat = "count")
# (d) is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count..), stat = "count")
# (e) pimped version:
ggplot(mpg) +
geom_bar(aes(x = class, fill = class),
# stat = "count",
color = "black") +
labs(title = "Counts of cars by class",
x = "Class of car", y = "Frequency") +
scale_fill_brewer(name = "Class:", palette = "Blues") +
theme_bw()Practice: Plot the number or frequency of cases in the mpg data by cyl (in at least 3 different ways).
Proportion of cases
An alternative to showing the count or frequency of cases is showing the corresponding proportion of cases:
library(ggplot2)
## Data:
ggplot2::mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
#> 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
#> 2 audi a4 1.80 1999 4 manual… f 21 29 p
#> 3 audi a4 2.00 2008 4 manual… f 20 31 p
#> 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
#> 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
#> 6 audi a4 2.80 1999 6 manual… f 18 26 p
#> 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
#> 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
#> 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
#> 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
#> # ... with 224 more rows, and 1 more variable: class <chr>
# (1) Proportion of cases by class:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..prop.., group = 1))
# is the same as:
ggplot(mpg) +
geom_bar(aes(x = class, y = ..count../sum(..count..)))Practice: Plot the proportion of cases in the mpg data by cyl (in at least 3 different ways).
Bar plots of existing values
A common difficulty occurs when the table to plot already contains the values to be shown as bars. As there is nothing to be computed in this case, we need to specify stat = "identity" for geom_bar (to override its default of stat = "count").
For instance, let’s plot a bar chart that shows the election data from the following tibble de:
| year | party | share |
|---|---|---|
| 2013 | CDU/CSU | 0.415 |
| 2013 | SPD | 0.257 |
| 2013 | Others | 0.328 |
| 2017 | CDU/CSU | 0.330 |
| 2017 | SPD | 0.205 |
| 2017 | Others | 0.465 |
- A version with 2 x 3 separate bars (using
position = "dodge"):
## Data: -----
de # => 6 x 3 tibble
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## Note that year is of type character, which could be changed by:
# de$year <- parse_integer(de$year)
## (1) Bar chart with side-by-side bars (dodge): -----
## (a) minimal version:
bp_1 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (A) 3 bars per election (position = "dodge"):
geom_bar(stat = "identity", position = "dodge", color = "black") # 3 bars next to each other
bp_1
## (b) Version with text labels and customized colors:
bp_1 +
## pimping plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%"), y = share + .01),
position = position_dodge(width = 1),
fontface = 2, color = "black") +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_bw()- A version with 2 bars with 3 segments (using
position = "stack"):
## Data: -----
de # => 6 x 3 tibble
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## (2) Bar chart with stacked bars: -----
## (a) minimal version:
bp_2 <- ggplot(de, aes(x = year, y = share, fill = party)) +
## (B) 1 bar per election (position = "stack"):
geom_bar(stat = "identity", position = "stack") # 1 bar per election
bp_2
## (b) Version with text labels and customized colors:
bp_2 +
## Pimping plot:
geom_text(aes(label = paste0(round(share * 100, 1), "%")),
position = position_stack(vjust = .5),
color = rep(c("black", "white", "white"), 2),
fontface = 2) +
# Some set of high contrast colors:
scale_fill_manual(name = "Party:", values = c("black", "red3", "gold")) +
# Titles and labels:
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
# coord_flip() +
theme_classic()Bar plots with error bars
It is typically a good idea to show some measure of variability (e.g., the standard deviation, standard error, confidence interval, etc.) to any bar plots. There is an entire range of geoms that draw error bars:
## Create data to plot: -----
n_cat <- 6
set.seed(101)
data <- tibble(
name = LETTERS[1:n_cat],
value = sample(seq(25, 50), n_cat),
sd = rnorm(n = n_cat, mean = 0, sd = 8))
data
#> # A tibble: 6 x 3
#> name value sd
#> <chr> <int> <dbl>
#> 1 A 34 1.71
#> 2 B 26 2.49
#> 3 C 42 9.39
#> 4 D 40 4.95
#> 5 E 30 -0.902
#> 6 F 31 7.34
## Error bars: -----
## x-aesthetic only:
# (a) errorbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "steelblue") +
geom_errorbar(aes(x = name, ymin = value - sd, ymax = value + sd),
width = 0.4, color = "orange", alpha = 1, size = 1.0)
# (b) linerange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "olivedrab3") +
geom_linerange(aes(x = name, ymin = value - sd, ymax = value + sd),
color = "firebrick", alpha = 1, size = 2.5)
## Additional y-aesthetic:
# (c) crossbar:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "tomato4") +
geom_crossbar(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
width = 0.3, color = "sienna1", alpha = 1, size = 1.0)
# (d) pointrange:
ggplot(data) +
geom_bar(aes(x = name, y = value), stat = "identity", fill = "burlywood4") +
geom_pointrange(aes(x = name, y = value, ymin = value - sd, ymax = value + sd),
color = "gold", alpha = 1.0, size = 1.2)More on barplots:
+++ here now +++
Drawing curves and lines
- adding trendlines
- lines of data (e.g., means)
Box plots
- show medians, quartiles, distribution, and outliers
Improving plots
Most default plots can be improved by fine-tuning their visual appearance. Popular levers for “pimping” plots include:
- colors: can be set withing geoms (variable when inside
aes(...), fixed outside), choosing or designing specific color scales;
- labels:
labs(...)allows setting titles, captions, axis labels, etc.;
- legends: can be (re-)moved or edited;
- themes: can be selected or modified.
More on data visualization
- study
vignette("ggplot")and the documentation forggplotand various geoms (e.g.,geom_); - study https://ggplot2.tidyverse.org/reference/ and its examples;
- see the cheat sheet on data visualization;
- read Chapter 3: Data visualization and Chapter 7: Exploratory data analysis (EDA) and complete their exercises.
Conclusion
All ds4psy essentials:
| Nr. | Topic |
|---|---|
| 1. | Creating and using tibbles |
| 2. | Data transformation |
| 3. | Visualizing data |
[Last update on 2018-07-06 18:09:02 by hn.]
This is different in Sankey diagrams, shown https://developers.google.com/chart/interactive/docs/gallery/sankey.↩